A Data Management Approach for Dataset Selection Using Human Computation

نویسندگان

  • Alexandros Ntoulas
  • Omar Alonso
  • Vasileios Kandylas
چکیده

As the number of applications that use machine learning algorithms increases, the need for labeled data useful for training such algorithms intensifies. Getting labels typically involves employing humans to do the annotation, which directly translates to training and working costs. Crowdsourcing platforms have made labeling cheaper and faster, but they still involve significant costs, especially for the cases where the potential set of candidate data to be labeled is large. In this paper we describe a methodology and a prototype system aiming at addressing this challenge for Web-scale problems in an industrial setting. We discuss ideas on how to efficiently select the data to use for training of machine learning algorithms in an attempt to reduce cost. We show results achieving good performance with reduced cost by carefully selecting which instances to label. Our proposed algorithm is presented as part of a framework for managing and generating training datasets, which includes, among other components, a human computation element.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation and selection of sustainable suppliers in supply chain using new GP-DEA model with imprecise data

Nowadays, with respect to knowledge growth about enterprise sustainability, sustainable supplier selection is considered a vital factor in sustainable supply chain management. On the other hand, usually in real problems, the data are imprecise. One method that is helpful for the evaluation and selection of the sustainable supplier and has the ability to use a variety of data types is data envel...

متن کامل

Intrusion Detection based on a Novel Hybrid Learning Approach

Information security and Intrusion Detection System (IDS) plays a critical role in the Internet. IDS is an essential tool for detecting different kinds of attacks in a network and maintaining data integrity, confidentiality and system availability against possible threats. In this paper, a hybrid approach towards achieving high performance is proposed. In fact, the important goal of this paper ...

متن کامل

Ensemble Classification and Extended Feature Selection for Credit Card Fraud Detection

Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effec...

متن کامل

A Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset

Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...

متن کامل

An Intelligence-Based Model for Supplier Selection Integrating Data Envelopment Analysis and Support Vector Machine

The importance of supplier selection is nowadays highlighted more than ever as companies have realized that efficient supplier selection can significantly improve the performance of their supply chain. In this paper, an integrated model that applies Data Envelopment Analysis (DEA) and Support Vector Machine (SVM) is developed to select efficient suppliers based on their predicted efficiency sco...

متن کامل

Feature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets

Objective(s): This study addresses feature selection for breast cancer diagnosis. The present process uses a wrapper approach using GA-based on feature selection and PS-classifier. The results of experiment show that the proposed model is comparable to the other models on Wisconsin breast cancer datasets. Materials and Methods: To evaluate effectiveness of proposed feature selection method, we ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1307.3673  شماره 

صفحات  -

تاریخ انتشار 2013